NOTICE:  When  govenment  or  other  dravinigB,  speci¬ 
fications  or  other  data  are  used  for  any  purpose 
other  than  In  connection  with  a  definitely  related 
government  procurement  operation,  the  U.  S. 
Government  thereby  incurs  no  responsibility,  nor  any 
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ment  may  have  foznolated,  furnished,  or  in  any  way 
supplied  the  said  drawings,  specifications,  or  other 
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ABSTRACT 


A  short  proof  is  given  of  the  convergence  (in  a  finite  number  of 
steps)  of  an  algorithm  for  adjusting  weights  in  a  single- threshold 
device.  The  algorithm  in  question  can  be  interpreted  as  the  error- 
correction  procedure  introduced  by  Rosenblatt  for  his  "fl-Perceptron.  “ 
The  proof  presented  extends  the  basic  idea  to  continuous  as  well  as 
discrete  cases,  and  is  interpreted  geometrically. 
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I  INTRODUaiON 


The  purpose  of  this  report  is  to  exhibit  an  extremely  short  and, 
more  notably,  transparent  proof  of  a  theorem  concerning  perceptrons. 

The  theorem  itself  must  now  be  considered  one  of  the  most  basic  theorems 
about  perceptrons,  and  indeed,  is  among  the  first  theorems  proved  by 
Rosenblatt  and  his  collaborators.  It  also  enjoys  the  peculiar  distinc¬ 
tion  of  being  one  of  the  most  often  re-proved  results  in  the  field.  The 
succession  of  proofs  now  available  progresses  from  somewhat  cloudy 
statements  (which  at  one  time  caused  doubt  among  “reasonable  men"  that 
the  theorem  was  true)  to  comparatively  crisp  statements  of  a  purely 
mathematical  nature  which  nonetheless  use  more  print  than  is  strictly 
necessary. 

More  to  the  point,  latter-day  proofs  fail  to  enunciate  a  simple 
principle  involved.  This  principle  permits  one  to  modify  the  hypotheses 
in  a  variety  of  ways  and  secure  similar  results;  it  may  well  be  useful 
in  establishing  genuinely  new  theorems  of  like  character.  We  therefore 
present  our  proof  in  its  entirety,  in  part  to  verify  our  claim  that  it 
is  as  short  a  line  as  can  be  drawn  from  hypotheses  to  conclusion,  and 
also  with  the  hope  of  terminating  an  already  lengthy  process  of  suc¬ 
cessive  refinements.  In  addition,  we  prove  a  related  theorem  which  is 
the  continuous  analogue  of  the  perceptron  theorem,  and  we  indicate  that 
various  other  theorems  may  be  obtained  by  appropriately  modifying  the 
hypotheses.  We  also  discuss  a  geometrical  interpretation  of  the  per¬ 
ceptron  theorem  in  terms  of  a  convex  cone  and  its  dual. 
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II  STATEMENT  OF  THE  THEOREM 


Whereas  previous  proofs  of  the  theorem  appealed  to  a  structure, 
called  by  Rosenblatt  and  his  co*workers  an  tt-perceptron,  the  present 
theorem,  proof,  and  discussion  apply  without  modification  to  a  struc¬ 
ture  consisting  of  a  single  threshold  element  acting  on  a  weighted  set 
of  inputs. 

Theorem'.  Let  Wj . be  a  set  of  vectors  in  a  Euclidean  space  of 

fixed  finite  dimension,  satisfying  the  hypothesis  that  there  exists  a 
vector  y  such  that 


(»j,y)  >  5  >  0  i  -  1 . N  (1) 

Consider  the  infinite  sequence  ».  ,w‘.  ,w,  .  1  <  i.  <  N  for  every  k, 

*1*2*3  —  ^ 

such  that  each  vector  Wj . occurs  infinitely  often.  Recursively 

construct  a  sequence  of  vectors  . *'„>•••  **  follows; 


Vg  is  arbitrary 


^  •  < 


v,-,  if  (Wj  >  e 

n 

^  if  .%-l)  £  0 


(2) 


The  sequence  is  convergent— t . e . ,  for  some  index  m,  ■ 


■  +  2 


'V 

V. 


Remarks : 

(1)  In  particular,  the  theorem  insures  that  (ui.,v’)  >  6  for  i  ■  1, 
since  each  w,  occurs  arbitrarily  far  out  in  the  sequence  {w.  }. 
It  is  only  to  obtain  this  consequence  that  we  impose  the  restriction 
that  each  occurs  infinitely  often  in  the  training  sequence. 


(2)  Theoretically,  we  may  take  0  ■  0  without  loss  of  generality.* 
However,  this  often  has  the  effect  of  smuggling  in  numbers  of  large  mag¬ 
nitude.  For  this  reason,  we  retain  the  general  6,  but  in  the  concluding 
section  we  do  consider  the  relation  between  the  general  case  and  the 
case  6  •  0. 

(3)  In  the  private  language  of  perceptron  workers,  the  theorem 
reads  as  follows:  A  set  of  incoming  signals  is  divided  into  two  adjacent 
classes.  A  "satisfactory"  assignment  of  weights  from  the  associator 
units  is  defined  as  an  assignment  resulting  in  a  response  +1  for  signals 
of  Class  I,  and  “1  for  signals  of  Class  II.  The  theorem  asserts  that 

no  matter  what  assignment  of  weights  we  begin  with,  the  process  of  re¬ 
cursively  readjusting  the  weights  by  the  method  known  as  error  correc¬ 
tion"  will  terminate  after  a  finite  number  of  corrections  in  a  satisfactory 
assignment,  provided  such  a  satisfactory  assignment  exists.  More  briefly, 
a  finite  number  of  corrections  will  teach  the  perceptron  to  perform  any 
given  dichotomy  of  signals,  if  the  dichotomy  is  within  the  capacity  of  the 
perceptron  at  all. 

The  definition  of  flt-perceptron  and  the  precise  correspondence  between 
the  theorem’s  original  verbal  description  and  the  purely  mathematical 
assertion  of  the  above  theorem  are  provided  by  Block. *  A  brief  glossary 
indicating  the  correspondence  follows:  the  vector  represents  the 
activity  of  the  associators,  including  class  information,  when  stimulus 
S.  is  presented.  The  vector  y,  which  we  assume  to  exist,  represents  a 
"satisfactory"  assignment  of  associator  weights;  y  has  as  many  components 
as  there  are  associators.  The  sequence  {wj  }  represents  the  "training 

n  , 

sequence,"  and  the  rule  for  defining  describes  the  error-correction 

procedure.  The  positive  number  0  is  a  threshold  which  must  be  exceeded 
for  the  response  of  the  perceptron  to  be  correct;  a  vector  v  such  that 
(»j,v)  >  0  is  an  assignment  of  associator  outputs  which  successfully 
classifies  the  ith  signal.  [If  (v^,v)  £  6,  then  either  the  ith  signal 
has  been  classified  as  belonging  to  the  incorrect  class  or  the  perceptron 
has  refrained  from  commitment,  depending  on  whether  or  not  the  inequality 
is  strict.] 

(4)  It  is  clear  that  because  of  Eq.  (2),  as  n  varies,  the  sequence 
changes,  if  at  all,  only  by  the  addition  of  one  or  another  of  the  set 

Wp...,Wjy.  For  this  reason  "convergence"  implies  "convergence  in  a  finite 
number  of  steps."  The  word  "stabiliies"  has  been  suggested  to  describe 
this  kind  of  convergence. 
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Ill  PROOF  OF  THEOREM 


We  may  omit  from  the  training  sequence  all  terms  ».  for  which 
•'ll  "  ’'ii-l'  these  are  clearly  inessential.  The  new  training  se¬ 
quence  is  such  that  correction  takes  place  at  every  step.  Adjusting 
our  notation,  we  may  assume  that 

•'n  "  "r-I  •'i  “"‘J  for  each  n  (3) 

»  n 

We  observe  that  n  is  the  number  of  corrections  made  up  to  the  nth  step. 
The  assertion  of  the  theorem  after  this  change  of  notation  is  that  n 
can  range  only  through  a  finite  set  of  integers— that  is,  conditions  (1) 
and  (3)  cannot  continue  to  hold  simultaneously  for  all  n  ■  1,2,3 . 

First  we  show  that  inequality  (1)  alone  implies  the  inequality 

0vj|2  >  Cn*  (4) 

for  suitable  choice  of  the  positive  constant  C,  and  n  sufficiently  large. 
Since  v,  •  w ,+  w. ^  ,  inequality  (1)  implies  that  satisfies 

^  ^  Using  tRe  Cauchy-Schwartz  inequality, 

llvjl*  >  -  >  - - -  .  _r_ 

llyll*  llyll*  .  Ilyll^ 

If  (vg.y)  >  0,  we  may  choose  C  -  0Vllyll^  and  inequality  (4)  is  satis¬ 
fied  for  all  n.  If  (vg,y)  <  0,  we  may  choose  C  -  ( 1/4)  (^Vllyll  *) ,  and 
inequality  (4)  is  satisfied  for  n  >  -2[(v g,y)/0] . 

On  the  other  hand,  we  show  that  inequalities  (3)  alone  imply  the 
inequality 


(vg.y)  ^ 
”  ^  — 0 — 


where 


M  ■  max  ll».||^ 
. M  ‘ 


(5) 
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Using  inequality  (3),  the  integer-argument  function  IIWjll*  satisfies  for 
each  k  the  difference  inequality 

llwjl!*  -  llwj.,11*  •  +  \\w^^\\^  <  29  ^  M  , 

Adding  the  inequalities  for  k  •  1,2 . n,  we  obtain  inequality  (5). 

Clearly,  inequalities  (4)  and  (5)  are  incompatible  for  n  sufficiently 
large. 


5 


IV  AN  ANALOGOUS  THEOREM 


The  theorem  of  Section  II  is  the  discrete  analogue  of  the  following 
theorem,  which  may  seem  more  intuitive:  Let  v(t)  be  a  curve  in  Euclidean 
■-space  described  by  a  smooth  vector  function  of  the  continuous  variable 
t,  such  that  there  exiats  a  vector  y  such  that 


and 


>  C  >  0 


(1)' 


7  -j-  llw(t)||* 

2  dt 


—  ,  vit) 
dt 


<e,  0  <  t  <  6 


(3)' 


There  exists  an  upper  bound  for  5;  in  particular,  inequalities  (1)'  and 
(3)'  are  compatible  only  over  a  finite  domain  on  the  t-axis. 

The  proof  ia  virtually  identical  with  that  of  the  discrete  case. 
Integrating  inequality  (1)'  from  0  to  t,  we  obtain 


>  [v(0),y]  +  Ct 


(6) 


Integrating  inequality  (3)'  from  0  to  t,  we  obtain 


tlv(t)||*  <  2et  +  ||i;(0)ll*  .  (7) 

Using  the  Cauchy-Schwartz  inequality  and  inequality  (6),  we  obtain 


.  X  9  ^  {[v(0),y]  +  Ct}^ 

llv{t)ir  >  -  >  - - -  (8 

llyll^  llyll* 

Inequalities  (7)  and  (8)  together  show  that  t  cannot  exceed  the  larger 
root  of  the  quadratic 

{[v(0),y]  +  Ct)^  ■  ||y||2{2(9f  +  ||v{0)||}2 
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Bemarka : 


(1)  Inequality  (])'  means  that  the  tangent  vector  to  the  curve  lies 
on  one  side  of  a  hyperplane.  Inequality  (3)'  means  that  the  rate  of  growth 
of  llvll^  is  bounded. 

(2)  We  may  compare  the  above  argument  for  the  continuous  case  with 
the  extremely  familiar  phenomenon  that 


implies 


0 


is  constant 


(9) 


(10) 


i.e.,  a  curve  whose  tangent  vector  is  always  perpendicular  to  its  posi¬ 
tion  vector  is  constrained  to  lie  on  a  sphere.  Replacing  the  orthogonality 
condition  (9)  with  the  inequality  (3)'  results  in  an  inequality  (7)  on 
the  rate  of  growth  of  the  function  f(t)  -  Ilu(t)||^  which  is  clearly  a 
weakening  of  the  condition  (10)  that  f(t)  be  constant. 

(3)  The  principle  involved  in  the  theorem  is  the  following:  The 
condition  of  (3)  ,  that  the  tangent  vector  have  bounded  scf^lar  product 
with  the  position  vector,  clearly  results  in  an  upper  bound  for  the  in¬ 
stantaneous  position  of  the  curve  as  a  function  of  time.  On  the  other 
hand,  if  the  tangent  vectors  dv/dt  to  the  curve  remain  sufficiently  large 
and  do  not  depart  too  badly  from  colinearity,  as  prescribed,  for  example, 
by  (1)',  then  a  lower  bound  on  the  cumulative  growth  results,  as  in  (8). 
This  is  intuitively  clear:  If  dv/dt.  does  not  get  too  small,  the  total 
arclength  will  increase  with  at  least  a  certain  rate.  If,  on  the  other 
hand,  the  dv/dt  are  sufficiently  "nearly  colinear”  then  the  serpentine 
path  swept  out  by  u(t)  cannot  reverse  its  direction  enough  to  prevent 
Its  over-all  migration  away  from  its  starting  point.  The  opposition  of 
these  two  influences  implies  the  termination  of  one  of  the  two  relations 
(1)'  or  (3)'. 

We  will  not  dwell  upon  the  matter  of  how  assorted  variations  of  this 
theme  will  continue  to  produce  assertions' that  t,  or,  in  the  discrete 
case,  n,  must  remain  bounded.  Whether  each  of  these  deserves  to  be  dig¬ 
nified  with  the  name  theorem  is  a  moot  point. 
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V  GEOMETRICAL  INTERPRETATION 


We  conclude  with  a  few  words  about  the  geometrical  interpretation 
of  the  assumption  (1).  We  assume  familiarity  with  the  theory  of  convex 
sets  in  Euclidean  vector  spaces.  For  the  most  part,  we  state  these 
remarks  without  proof.  For  a  general  introduction  to  this  theory,  see 
Blackwell  and  Girshick,^  and  Gale.® 

The  polyhedral  cone,  C,  with  generators,  », . is  defined  as 

all  vectors  of  the  form  +  ...  +  where  >  0,  i  -  1;2 . N. 

The  cone  C  is  called  proper  if  for  all  v  f  0,  C  never  contains  both  v 

and  -i»;  or  equivalently,  if  C,  apart  from  its  vertex,  lies  in  the  inferior 

of  a  half  apace.  The  condition  that  there  exists  a  vector  y  satisfying 
(1)  is  precisely  equivalent  to  the  condition  that  C  is  a  proper  cone. 

We  remark  that  requiring  the  existence  of  a  vector  y  which  satis¬ 
fies  (1)  with  6  >  0,  is  neither  stronger  nor  weaker  than  requiring  the 
existence  of  a  vector  y  which  satisfies  (1)  with  6  ■  0-i.e.,  which 
satisfies 


.y)  >  0  ,  i  '  1, .  .  .  ,/V  .  (!)• 

Indeed,  y  itself  can  serve  for  y;  conversely,  given  ^  and  y  satisfying 
(1)  ,  any  sufficiently  large  positive  multiple  y  *  A.y  of  y  with 


min  {Vi,y) 

«»l . It 

will  satisfy  (1).  Condition  (1)"  is  the  customary  way  of  specifying 
that  the  cone  C  be  proper. 

For  the  continuous  analogue,  requiring  that  infinitely  many  vectors 
dv/dt  satisfy  (])'  with  c  >  0  is  actually  stronger  than  requiring  only 
that  dv/dt  satisfy 

(^  .  y)  >  0  ,  0  <  f  <  6  .  (1)"' 


a 


In  fact,  the  left-hand  side,  though  positive  for  each  f,  need  not  be 
bounded  away  from  zero.  If  we  assume  (1)"'  to  hold  for  0  <  t  <  6 
(equality  permitted  at  6)  and  dv/dt  to  be  continuous,  then,  as  in  the 
discrete  case,  it  is  true  that  (1)*“  and  (1)*  are  precisely  equivalent. 

The  cone  C*  of  all  vectors  v  such  that  (w,u)  ^  0  for  all  v>  in  C, 
or  equivalently  such  that 

(w.,v)  >  0  ,  i  «  I,... ,N  ,  (10) 

where  C  is  the  polyhedral  cone  generated  by  iv,,...,v  ,  is  called  the 
dual  cone  of  C.  Its  interior  consists  of  all  v  for  which  every  in¬ 
equality  in  (10)  is  strict.  The  bigger  C  is  the  smaller  C*  is,  and 
vice  versa;  for  example,  when  C  is  a  half-space,  C*  is  a  half-line.  In 
general,  for  n  ^  2  neither  need  include  the  other.  On  the  other  hand, 
it  is  not  possible  to  weakly  separate  C  and  C*—i.e.,  there  is  no  z  /  0 
such  that  both  (w.z)  >  0  for  all  u  in  C,  and  (v,z)  <  0  for  all  v  in  CV 
If  such  a  z  did  exist,  then  by  the  first  inequality,  z  is  in  G*;  then 
choosing  v  to  be  z  in  the  second  inequality  implies  that  (z,z)  £  0 — 
i.e.,  that  z  •  0,  contrary  to  the  assumption  that  z  /  0. 

When  C  is  proper,  C*  has  an  interior;  indeed,  the  y  of  (1)"  is  in 
the  interior  of  C*.  If  C*  has  an  interior,  C  and  the  interior  of  C* 
overlap;  if  not,  then  a  consequence  would  be  that  C  and  C*  could  be 
weakly  separated  by  a  classic  result  on  the  separation  of  convex  sets. 
This  we  have  seen  to  be  impossible. 

As  previously  observed,  if  y  is  in  the  interior  of  C*,  then  a  suit¬ 
able  positive  multiple  of  y  will  satisfy  (1).  Let  the  set  of  vectors 
satisfying  (1)  be  denoted  by  D.  D  is  a  subset  of  the  interior  of  C*. 

The  relation  between  (1)  and  the  dual  cone  defined  by  (10)  may  be  sum¬ 
marized  as  follows:  (1)  has  a  solution  [i.e.,  D  is  non-empty)  if  and 
only  if  C*  has  an  interior. 

The  error-correction  procedure  is  a  recursive  construction  of  a 
vector  in  D  of  the  form 


+  ...  + 

where  is  the  number  of  times  w.  occurred  in  the  irredundant  training 
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sequence  before  the  termination  of  the  correction  process.  The  mere 
existence  of  such  a  vector  (without  a  construction  algorithm)  is  assured 
by  the  fact  that  C  and  D  overlap,  which  is  a  consequence  of  the  above- 
mentioned  overlap  between  C  and  the  interior  of  C* . 

In  general,  if  condition  (1)  is  fulfilled,  so  that  », . 

generate  a  proper  cone,  it  will  happen  that  some  subset  of  the  w’s  will 
generate  the  same  cone  (which  then  has  the  same  dual  C*),  and  it  suffices 
to  restrict  attention  to  this  subset  in  constructing  the  training  sequence. 
To  accelerate  the  termination  of  the  correction  process  one  should  use 
for  correction  those  »'s  which  themselves  are  nearest  the  interior  of  C*, 
and  which  are  as  long  as  possible.  If,  for  example,  i»j  ■  Wj  +  the 

single  addition  of  Vj  will  accomplish  as  much  as  the  successive  additions 
of  Wj  and  Wj-  The  question  of  whether  a  dichotomy  is  within  the  com¬ 
binatorial  capacity  of  an  flt-type  perceptron  reduces  to  whether  or  not  C* 
has  an  interior,  or  equivalently,  whether  or  not  C  is  proper.  This 
question  is  discussed  in  a  somewhat  different  context  by  Joseph  and  Hay,^ 
and  Keller . * 

After  the  work  was  completed  it  was  pointed  out  that  S.  Agmon  had 
previously  considered^^  a  variety  of  similar  correction  procedures  (none, 
however,  identical  with  the  above)  and  showed  their  convergence  in  general 
to  be  at  a  geometric  rate. 
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